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Abstract 


The success of support vector machines in binary classification relies on the fact that hinge loss 
employed in the risk minimization targets the Bayes rule. Recent research explores some extensions 
of this large margin based method to the multicategory case. We show a moment bound for the so- 
called multi-hinge loss minimizers based on two kinds of complexity constraints: entropy with 
bracketing and empirical entropy. Obtaining such a result based on the latter is harder than finding 
one based on the former. We obtain fast rates of convergence that adapt to the unknown margin. 
Keywords: multi-hinge classification, all-at-once, moment bound, fast rate, entropy 


1. Introduction 


We consider multicategory classification with equal cost. Let Y € {1,...,m} denote one of the m 
possible categories, and let X € Rf be a feature. We study the classification problem, where the 
goal is to predict Y given X with small error. Let {(X;, ¥;)}_, be an independent and identically 
distributed sample from (X,Y). In the binary case (m = 2) a classifier f : R? — R can be obtained 
by minimizing the empirical hinge loss 


(1 —Yif(Xi))+ (1) 


sie 
M= 


i=1 


over a given class of candidate classifiers f € F, where (1 — Y f(X))+ := max(0,1—Yf(X)) with 
Y € {+1}. Hinge loss in combination with a reproducing kernel Hilbert space (RKHS) regular- 
ization penalty is called the support vector machine (SVM). See, for example, Evgeniou, Pontil, 
and Poggio (2000). In this paper, we examine the generalization of (1) to the multicategory case 
(m > 2). We refer to this classifier as the multi-hinge, although, instead of RKHS-regularization we 
will assume a given model class ¥ satisfying a complexity constraint. We show a moment bound 
for the excess multi-hinge risk based on two kinds of complexity constraints: entropy with brack- 
eting and empirical entropy. Obtaining such a result based on the latter is harder than finding one 
based on the former. We obtain fast rates of convergence that adapt to the unknown margin. 

There are two strategies to generalize the binary SVM to the multicategory SVM. One strategy 
is by solving a series of binary problems; the other is by considering all of the categories at once. 
For the first strategy, some popular methods are the one-versus-rest method and the one-versus-one 
method. The one-versus-rest method constructs m binary SVM classifiers. The j-th classifier fj 
is trained taking the examples from class j as positive and the examples from all other categories 
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as negative. A new example x is assigned to the category with the largest values of f(x). The 
one-versus-one method constructs one binary SVM classifier for every pair of distinct categories, 
that is, all together m(m — 1)/2 binary SVM classifiers are constructed. The classifier fj; is trained 
taking the examples from category i as positive and the examples from category j as negative. For 
a new example x, if fi; classifies x into category i then the vote for category i is increased by one. 
Otherwise the vote for category j is increased by one. After each of the m(m — 1)/2 classifiers 
makes its vote, x is assigned to the category with the largest number of votes. See Duan and Keerthi 
(2005) and the references therein for an empirical study of the performance of these methods and 
its variants. 

An all-at-once strategy for SVM loss has been proposed by some authors. For examples, see 
Vapnik (2000), Weston and Watkins (1999), Crammer and Singer (2000, 2001), and Guermeur 
(2002). Roughly speaking, the idea is similar to the one-versus-rest approach but all the m classifiers 
are obtained by solving one problem. (See Hsu and Lin, 2002, for details of the formulations.) Lee, 
Lin, and Wahba (2004) (see also Lee, 2002) show that the relationship of the formulations of the 
approaches above to the Bayes’ rule is not clear from the literature and that they do not always 
implement the Bayes’ rule. They propose a new approach that has good theoretical properties. That 
is, the defined loss is Bayes consistent and it provides a unifying framework for both equal and 
unequal misclassification costs. 

We consider the equal misclassification cost where a correct classification costs 0 and an incor- 
rect classification costs 1. The target function f : R? — R” is defined as an m-tuple of separating 
functions with zero-sum constraint X’; f;(x) = 0, for any x € Rf. Hence, the classifier induced by 
Fis 

eC) ae ae TC); (2) 


Analogous to the binary case, when applying RKHS-regularization, each component f(x) is con- 
sidered as an element of a RKHS Hx = {1} + Hkg, for all j = 1,...,m. That is, f(x) is expressed 
as h(x) +b; with hj E€ Hg and b; some constant. To find f(-) = (fi(-),---,fm(-)) € T” Hk with 
the zero-sum constraint, the extension of SVM methodology is to minimize 


IG í ai 
ny Be FIR) + + aA lie- T 
i=1 j=1,j#Y; = 


Based on (3), the multi-hinge loss is now defined as 


IX) = E (F(X) + 


j=1,j#Y 





ji (4) 


m—1 


The binary SVM loss (1) is a special case by taking m = 2. When Y = 1, /(1, f(X)) = (fo(X) + 
1)4 =(1—fi(X))+. Similarly, when Y = —1, /(—1, f(X)) = (1 + fi(X))+. Thus, (4) is identical 
with the binary SVM loss (1 — Y f(X))+, where fı plays the same role as f. 

Using a classifier g defined as in (2), a misclassification occurs whenever g(X) # Y. Let P be 
the unknown underlying measure of (X,Y). The prediction error of g is P(g(X) #Y). Let p;(x) 
denote the conditional probability of category j given x € Rf, j = 1,...,m. The prediction error 
is minimized by the Bayes classifier g* = arg max j=1,...m pj, and the smallest prediction error is 


P(g*(X) #Y). 


parag 


2172 


A MOMENT BOUND MULTI-HINGE CLASSIFIERS 


The theoretical multi-hinge risk is the expectation of the empirical multi-hinge loss with respect 
to the measure P and is denoted by 


RCF) = [10 F() dPey) , 5 


with /(Y, f(X)) defined as in (4). In this setting, Bayes’ rule f* is then an m-tuple separating func- 
tions with 1 in the kth coordinate and —1/(m— 1) elsewhere, whenever k = argmax j=1,...m Pj(x); 
x € R¢. Lemma 1 below shows that multi-hinge loss (4) is Bayes consistent. That is, f* minimizes 
multi-hinge risk (5) over all possible classifiers. We write R* = R(f*), the smallest possible multi- 
hinge risk. Lemma 1 is an extension of Bayes consistency of the binary SVM that has been shown 
by, for example, Lin (2002), Zhang (2004a) and Bartlett, Jordan, and McAuliffe (2006). 


Lemma 1. Bayes classifier f* minimizes the multi-hinge risk R( f). 


This lemma can be found in Lee, Lin, and Wahba (2004), Zhang (2004b,c), Tewari and Bartlett 
(2005) and Zou, Zhu, and Hastie (2006). We give a self-contained proof in Appendix for com- 
pleteness. They establish the conditions needed to achieve the consistency for a general fam- 
ily of multicategory loss functions extended from various large margin binary classifiers. They 
also show that the SVM-type losses proposed by Weston and Watkins (1999) and Crammer and 
Singer (2001) are not Bayes consistent. Tewari and Bartlett (2005) and Zhang (2004b,c) also 
show that the convergence to zero (in probability) of the excess multi-hinge risk R(f) — R* im- 
plies the convergence to zero with the same rate (in probability) of the excess prediction error 
P(g(F(X)) # Y) —Ple(f*(X)) £Y). 

The RKHS-regularization (3) has attracted some interest. For example, Lee and Cui (2006) 
study an algorithm of fitting the entire regularization path and Wang and Shen (2007) study the use 
of l; penalty in place of the l penalty. In this paper, we will not study the RKHS-regularization but 
we take the minimization of the empirical multi-hinge loss over a given class of candidate classifiers 
F satisfying a complexity constraint. That is, we do not invoke a penalization technique. 


Let F be a model class of candidate classifiers. For j = 1,...,m, we assume that each fj is a 
member of the same class Fo = {h : R? — R,h € L2(Q)}, with Q the unknown marginal distribution 
of X. That is, 


m 


EE (fis fm): yt = 0 fj E Po}. (6) 
j=l 


Let P, be the empirical distribution of (X,Y) based on the observations {(X;, Y;)}7_, and Q, the 
corresponding empirical distribution of X based on X1,...,X,. We endow F with the following 
squared semi-metrics 


IF-o = ¥ / If- ÈI? dO, and 
j=l 


If- Flo, = -EIE -AE , 


j=l “i=l 


Il 
ma 


for all f, f € F. We impose a complexity constraint on the class F, in term of either the entropy 
with bracketing or the empirical entropy. Below we give the definitions of the entropies. 
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Definition of entropy. Let G be a subset of a metric space (A,d). Let 
H(e,G,d) :=logN(€,G,d) , foralle >0, 


where N(€, G,d) is the smallest value of N for which there exist functions g1,...,gNn in G, such that 
for each g € G, there isa j = j(g) € {1,...,N}, such that 


d(g,gj) SE 


Then N(€, G,d) is called the -covering number of G and H(€,G,d) is called the €-entropy of G 
(for the d-metric). 


Definition of entropy with bracketing. Let G be a subset of a metric space (A,d) of real-valued 
functions. Let 
Hp(€,G,d) :=logNa(€,G,d) , foralle >0, 


where ae G,d) is the smallest value of N for which there exist pairs of functions 
{[gh, 27 ],---,[gk, e4]} such that d(gi,g) < € for all j = 1,...,N, and such that for each g € G, 
there is a j = j(g) € {1,...,N} such that 


sissy. 
Then Ng(£, G,d) is called the €-covering number with bracketing of G and Hg(€, G,d) is called the 


€-entropy with bracketing of G (for the d-metric). 


Let Hg(€, Fo,L2(Q)) and H(€, Fo, L2(Qn)) denote the €-entropy with bracketing and the empiri- 
cal €-entropy of the class Fo, respectively. The complexity of a model class can be summarized in a 
complexity parameter p € (0, 1). Let A be some positive constant. We consider classes F, satisfying 
one of the following complexity constraints: 


Ae~*P , foralle >0, or 
Ae”? , for alle > 0, as. foralln>1. 


Ag(e, Fo,L2(Q)) 


< 
H(€, Fo, L2(Qn)) < 


It is straightforward to show that for all € > 0: 
Hal, F,||-ll2,0) < (m—1) Aa(e(m—1)7", Fo, Lo(Q)) , 
H(€,F,l-ll2,0,) < (m—1) A(e(2(m—1))7"?, Fo, Lo(Qn)) « 
We define the minimizer of the empirical multi-hinge loss (without penalty) 


1# k 1 
fi mpm 72 PA Jis (7) 
nid ir m—1 





where the model class F defined as in (6) satisfies either an entropy with bracketing constraint or 
an empirical entropy constraint described above. 

Besides the model class complexity, the rate of convergence also depends on the so-called mar- 
gin condition (see Condition A below) that quantifies the identifiability of the Bayes rule and is 
summarized in a margin parameter (or noise level) k > 1. In Tarigan and van de Geer (2006), a 
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probability inequality has been obtained for /,-penalized excess hinge risk in the binary case that 
adapts to the unknown parameters. In this paper, we show a moment bound for the excess multi- 
hinge risk R( fa) — R* of fy over the model class F with rate of convergence n—*/(2K-1+P) which is 
faster than n~ 1/2. 

In Section 2 we present our main result based on the margin and complexity conditions. The 
proof of the main result is given in Section 3, together with our supporting lemmas. For the sake 
of completeness and to avoid distraction, we place the proof of some supporting lemmas in the 
Appendix. 


2. A Moment Bound for Multi-hinge Classifiers 


We first state the margin and the complexity conditions. 


Condition A (Margin condition). There exist constants 6 > 0 and « > 1 such that for all f € F, 


R(f)—R* > ($ S-na)". 


Condition B1 (Complexity constraint under €-entropy with bracketing). LetO< p < 1 and let A 
be a positive constant. The €-entropy with bracketing satisfies the inequality 


Hg(e, Fo, L2(Q)) SAE? , foralle>0. 


Condition B2 (Complexity constraint under empirical €-entropy). LetO < p < 1 and let A be a 
positive constant. The empirical €-entropy, almost surely for all n > 1, satisfies the inequality 


H(e, Fo,L2(Qn)) < Ae? , foralle>0. 


Now we come to the main result. 


Theorem 2. Assume Condition A is met and that |f; — f;| < M for all j =1,...,m, and all f = 


(fi, fm) E F. Let fa be the multi-hinge loss minimizer defined in (7). Suppose that either 
Condition B1 or Condition B2 holds. Then for small values of Ò > 0, 


À „— 1+8 
E[R(fa)—R | < f=25 





inf {R( f)-R4Cyn F : fEF \ 
with Co some constant depending only on m, M, x, ©, A and p. 
Condition A follows from the condition on the behaviour of the conditional probabilities p ;. 


We formulate this in Condition AA below. We require that, for a fixed x € IR“, there is no pair of 
categories having the same conditional probabilities each of which stays away from 1. Originally 
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the terminology “margin condition” comes from the binary case of the prediction error considered 
in the work of Mammen and Tsybakov (1999) and Tsybakov (2004), where the behaviour of p1, 
the conditional probability of category 1, is restricted near {x : pi(x) = 1/2}. The “margin” set 
{x : p(x) = 1/2} identifies the Bayes predictor which assigns a new x to class 1 if pı(x) > 1/2 
and class 2 otherwise. The margin condition is also called the condition on the noise level, and it is 
summarized in a margin parameter K. Boucheron, Bousquet, and Lugosi (2005, Section 5.2) discuss 
the noise condition and its equivalent variants, corresponding to the fast rates of convergence, in the 
binary case. Thus, Condition AA is a natural extension for the multicategory case wrt. hinge loss. 
Lemma 3 below gives the connection between Condition A and Condition AA. We provide the 
proof in the Appendix. For x € X, let p(x) = max je{1,...m} Pj(x) and define 


a(x) = min{|p (2) — Pa) 1 = Pe) (8) 
where j and k take values in {1,2,...,m}. 


Condition AA. Let t be defined in (8). There exist constants C > 1 and y > 0 such that Vz > 0, 
O({t <z}) < (Cz). 
[Here we use the convention (Cz)! = 1{z > 1/C} for y=0.] 


Lemma 3. Suppose Condition AA is met. Then for all f € F with |f}; =f; | <M for all j =1,...,m, 


N: 


Let 1+ 
spes at 
RO)-R > (È J i-ga) 
where Oy = C(mM(1/y+1))’(1+¥). That is, Condition A holds with 6 = (oy)'/* and x =1 +Y. 


Remark. In the definition of t we have the extra piece 1 — px. It is needed for technical reason. 
It forces that nowhere in the input space one class can clearly dominate. We refer to the work of 
Bartlett and Wegkamp (2006, Section 4) and Tarigan and van de Geer (2006, Section 3.3.1) for 
some ideas how to get around this difficulty. 


The complexity constraints Bl and B2 cover some interesting classes, including 
Vapnik-Chervonenkis (VC) subgraph classes and VC convex hull classes. See, for example, van der 
Vaart and Wellner (1996, Section 2.7), van de Geer (2000, Sections 2.4, 3.7, 7.4, 10.1 and 10.3) and 
Song and Wellner (2002). In the situation when the approximation error inf reg R(f) — R* is zero 
(the model class F contains the Bayes classifier), Steinwart and Scovel (2005) obtain the same rate 
of convergence for the excess hinge risk under the margin condition A and the complexity condition 
B2. They consider the RKHS-regularization setting for the binary case instead. 

We do not explore the behaviour of the approximation error inf e¢ R(f) — R*. This problem is 
still open and very hard to solve even in the binary case. 


3. Proof of Theorem 2 


Let f° := argmin fef R(f), the minimizer of the theoretical risk in the model class F. As shorthand 
notation we write for the loss Lf =1;(X,Y) =1(Y, f(X)). We also write v, (l4) = vn (Rn(f) —R(f))- 
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Since R, (fn) —R,(f) <0 for all f € F, we have 


< [Rn Aa) —R(fn)| + [Rn (Ff?) — R(F?)] + RUF?) R* 
= [Vall ) — Vallee) |/Vn + R(f?) —R* : (9) 


A 


R( fn) — R* 


We call inequality (9) a basic-inequality, following van de Geer (2000). This upper bound enables 
us to work with the increments of the empirical process {v, (lf) —Vn(lfo) : Lp € £} indexed by the 
multi-hinge loss lf € L, where L = {ly : f E F}. 
The procedure of the proof is based on the proof of Lemma 2.1 in del Barrio et al. (2007), page 
206. We write 
z= lvn(Zy) mei) TETS, 
(lis = ipele yn =F) 


where (aV b) := max{a,b}, ILe p = Sly) dP(x,y) and p is from either Condition B1 or B2. 
For short hand of notation, we also write Z, = Z,(J; ). Then 





RCR) — R* < (Za/ Vn) (Mg -Ipli Vn) + R(f?)— RP. (10) 


Applying the triangular inequality and Lemma 4 below gives 








1— = a xi L— xil- 
e aT 
Observe that for any f € F with |f; — f| < M, and for all j, Condition A gives If —f* lio < 
Mo (R(f) — R*)!/*. Thus, 
lip = loll P <C { (Rin) SRG pies [R(f?) — R']00/2} 


with Cı = ((m—1)Mo)"!-°)/?, Denote by R, the right hand side of the above inequality. Hence, 
from (10) we have 


RCR) —R* < Zn/ Vn) (RV 2) +R) — Re 
We consider first the case (RV n 77h ) = R. That is, 
i * Zn f *7](1— 0 ok — 0 k 
RC) — RY < EC S IR fa) — ROO" + (RUPP) -ROO R) -R 
Two applications of Lemma 5 below yield for all 0 < ò< 1, 


A 


R(Jn) — R* 





B(R(f,) —R*) + (1-+8)(Rf?) — R*) +2(C1 Z,/ Vn) 7% BBP 
< SRG) =R") + (1+8)( RU) -R" +C: Z; n =), 


IA 


with Q2 =2 CI 8-20 and r= 2«/(2Kk—1+p). Now it is left to show that IE[Z’] is bounded, say 
by some constant C3. Then, Co = C2C3 in Theorem 2. 

To show that IE[Z/] is bounded, we use an exponential tail probability of the supremum of the 
weighted empirical process 


(Z(G) Se L}. (11) 
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We recall that Hg(e, F , || - |l2.9) < (m—1)Hp(e(m— 1)~'/?, F,,L2(Q)). A key observation is that 
Hp(€,L£,L2(P)) < (m— 1) Hp(e(m—1)-"”?, F,||-|I2,0) > 


by Lemma 4. It gives an upper bound for the €-entropy with bracketing of the model class £: 
Hp(€, L, L2 (P)) < Age *?, for all e > 0, with A, = A(m—1)?*??. Under Condition B1, an ap- 
plication of Lemma 5.14 in van de Geer (2000), presented below in Lemma 6, gives the desired 
exponential tail probability. Hence, for some positive constant c, 


Ez] = ae P(Z, >11") dt 


c+ f c exp(— 


For the case R, < n™(!-P)/C+2P), we have 


oa Ut, ). 


IA 





R(fa) -R* < Za nC) 4 RUPP) -R . 


We conclude by noting that n™!/0+P) < n-*/(2-1+P) where x> 1 and0 <p <1. 

Now we consider the case where Condition B2 holds instead of B1. By virtue of the proof 
above, we need only to verify an exponential probability of the supremum of the process (11) under 
Condition B2 instead of B1. This is done by employing Lemmas 7-9 below. Again, a key observa- 
tion is that Lemma 4 and Condition B2 give us H (£, £,L2(P,)) <A(m—1)?*?Pe~°P, a 


Lemma 4 gives an upper bound of the squared L2(P)-metric of the excess loss in terms of 
|| - |/2,9-metric. 


Lemma 4. E[(/;(X,Y) —1p-(X,Y))7] < (m-N YS lfi- FF)? de. 


Proof. We write A( f, f*) = Ey,y[(Ip(X,Y) —ly(X,Y))?|X =x] and recall that p;(x) = P(Y = j|X = 
x), for all j = 1,...,m. We fix an arbitrary x € Rf. Definition of the loss gives 


Af, f“) Èp (EG - (f*4 t un 





| 
Ms 
= 
Mm 
S 
S 
os 
M 
T 
i 
=} 
= 


where I*(j) = {iA j : fj > —-1/(m—1), i=1,...,m} TSN) Medes fi <—-1/(m- 
1), i=1,...,m}. Use the facts that (Y%_, aj)? < nyt 1a? for all n € N and a; € R, and that 
max{|I+(j)|, W7 (j)|} < m-— 1, to obtain 


AUS) < (m—1) p E G- + E e 
=1 icI+(j) icI- (j) 


Clearly, | —1/(m—1) — f*| < |fi—f?| for all i € I7 (j). Hence, 


m 


A(f, f*) NY pj VIA-APY =™-1) Va-pplfj-Fl, 


j=l Fj j=l 
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where the last equality is obtained using i pj = 1. We conclude the proof by bounding 1 — p; 
with 1 for all j and integrating over all x € R wrt. the marginal distribution Q. E 


The technical lemma below is an immediate consequence of Young’s inequality (see, for ex- 
ample, Hardy, Littlewood, and Pólya, 1988, Chapter 8.3), using some straightforward bounds to 
simplify the expressions. 


Lemma 5 (Technical Lemma). For all positive v, t, 6 and x > B: 


K -P 
vt? < 8 + veB S. 


To ease the exposition, throughout Lemma 6 and Lemma 7 we write ||- || = || - ||2,9 and ||- ||n = 
|| - llz,0, for the L2(Q)-norm and the L2(Q,,)-norm, respectively. 


Lemma 6 (van de Geer, 2000, Lemma 5.14). For a probability measure Q, let H be a class of 
uniformly bounded functions h in L2(Q), say suppe |h —h°|oo < 1, where h° is a fixed but arbitrary 
function in H. Suppose that 


Hp(e,H,L2(Q)) < Ave? , for alle >0, 
with 0 < p < 1 and A, > 0. Then for some positive constants c and no depending only on p and Ao, 


h) —v,(h? 
Psp — Walt) = volt 
het (h= hel] Vn =s) 





> t) < c exp(—t/c’) , 


forallt >c andn > no. 


Lemma 7. For a probability measure Q on (Z, A), let H be a class of uniformly bounded functions 
h in La(Q), say suppegs |h — h° |o < 1, where h° is a fixed but arbitrary element in H. Suppose that 


H(£, H, L(Qn)) <A” , for alle >0, 


with 0 <p < 1 and Ao > 0. Then for some positive constants c and n, depending on p and Ao, 





P( sup \Vn(h) —v,(h?)| = > t) <c exp(—t/c?) , 


het (h=V n = ) 
forallt >c andn > no. 


Proof. For n > (t? /8)!+P/0-P), Chebyshev’s inequality and a symmetrization technique (see, for ex- 
ample, van de Geer, 2000, page 32) give 


P (sup [Vn(h) — Va (h°)| - >r) 
heH (\\n— x Vn-W/(2+29)) 
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€ _ ye (0 
< aP ( sup [vé(h) — vé(h?)| = vi/4) (12) 
heH (2H? In vn) 





|h- h° ljn” 
+4P{ sup = > Vt/4), (13) 
heH (lane |] vn-1/(2+20) ) 





where v£ (A) is the symmetrized version of the v, (h). That is, vé(h) = (1/./n) £; £:h(Zi), where 
{€;}_, are independent random variables, independent of {Z;}/_,, with P(e; = 1) = P(e; = —1) 
1/2 for alli=1,...,n 

To handle (12), we divide the class H into two disjoint classes where the empirical distance 
|A — h°||, is smaller or larger than n—!/?+2°), Write H, = {he H: ||h—h? ||, < nC}. By 
Lemma 5.1 in van de Geer (2000), stated below in Lemma 8, for some positive constant c1, 


[va (h) -va (he )| t n/P) 
ae iea = V14) Seep (— 4e? ) 








Let J = min{ j >1:2/ < n—1/(2+2p) 9, We apply the peeling device on the set {h € H:2-/ < 
|A- A? |p < 2-1, j=1,...,J} to obtain that, for all t > 1, 


vs (h ye 
P( sup MAW VAT > vi/4|Zi,..., Zn) 
nese || — h° ||n 


J ae 
<P P( sup Ph) -via 2 n FUAR Pisa) 
j=1 heH 
]h—h? |< 2-F41 
J t 22P/ 
< ——,) < —t } 
< 2, 20x ai 2) c exp(— Je ) 


To handle (13), we use a modification of Lemma 5.6 in van de Geer (2000), stated below in 
Lemma 9, where we take t such that (/t/4)!/C-?) > 14u. a 


Lemma 8 (van de Geer, 2000, Lemma 5.1). Let Z),...,Zy,... be iid. with distribution Q on 
(Z,A). Let {¢;}"_, be independent random variables, independent of {Z;}?_,, with P(e; = 1) = 
P(e; = —1) = 1/2 for alli =1,...,n. Let H C Lo(Q) be a class of functions on Z. Write vé(h) := 
ai EL] e:h(Z;), with h € H. Let 


H(8):= {he H: |h-h°llz0 <8}, $n := sup ||h—A’ll29, , 
neH (8) 
where h° is a fixed but arbitrary function in H and Q, is the corresponding empirical distribution 


of Z based on {Z;}"_,. Fora> ac( 1 Boyn E 1/2u, H, Qn)du V i), where C is some positive 


constant, we have 


a 
P(_ sup ia) -viO > $ 


he H(8) 





2 
a 
Zisis Z < Cex EA 
; ) Pl GAC 
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The following lemma is a modification of Lemma 5.6 in van de Geer (2000). 


Lemma 9. For a probability measure S on (Z,A), let H be a class of uniformly bounded functions 
independent of n with suppe y |h] < 1. Suppose that almost surely for all n > 1, 


H(e,H,Ly(S,)) <A”, for alle >0, 
with 0 <p < 1 and Ao > 0. Then, for all n, 
IlAll2,5, o 


P( sup > 14u) <4 exp(—u? nt) , 
heH ||h|l2,5 Vn 2 


forallu> 1. 


Proof. Let {5,,} be a sequence with 8, — 0, n82 — 00, nd? > 2A,H(5,) for all n with H(8,) = r”. 
We apply the randomization device in Pollard (1984, page 32), as follows. Let Zn+1,...,Zon be 
an independent copy of Z),...,Z,. Let @1,...,@, be independent random variables, independent 
of Z1,...,Zon, with P(@; = 1) = P(@; = 0) = 1/2 for all i=1,...,n. Set Z’ = Zoj-140, and 
Zi" = Zyi-o i= 1,...,n, and S,’ = (1/n) DL, zr, Sn” = (1/n) LE, bz, and Son = (Sy + Sn”) /2. 
Since the class is uniformly bounded by 1, an application of Chebyshev’s inequality gives that for 
each hin H, 








I[All2,5, 1 
P(o e <2) >1—-—— >3/4, 
I|A\|2,5 V n 4u? / 


for all u > 1. Use a symmetrization lemma of Pollard (1984, Lemma II.3.8), see Appendix, to obtain 


h h 1—|lh ” 
P( sup Ilall s, > 14u) < 2P( sup ! lz sy — llAll2,s,"| 
heH llAll2,s V 8p heH IA Ilo, V 8p 





> 12u) A 


The peeling device on the set 
{he H : (2u) tSn < ||All2s < (2u)/8, ,7 = 1,2,...} 
and the inequality in Pollard (1984, page 33) give 


h i — h UA 
P( || ll2,5, I lz s,”l 
ne llAlj2,s V òn 





> 12u | Ligh) 


P( sup ||lAlls," —|lAlls,”| > 6(2u)%6, | Z) 
hEH 
llhl|2 s< (2u)iő, 


lA 
IM: 


< 


mM: 


2exp (H(V2(2u) ðn, H San) — 2n(2u)/8;) 


na. 
Il 
rane 


IA 
M: 


2exp (H20), H, Sy!) +H ((2u) Sn, H, Sa!) —2n(2u)/8;) 
1 


nm. 
ll 
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<p? 2exp ( —n( (2u)*8;) , (14) 


where the last inequality is obtained using that since nô? > 2A,H(6,), also nt? > 2A,H(t) for all 
t > 8, (here t = (2u)/5,). Observe that, since (2u)”/ > (2u)*j > u?j for all u > 1 and j > 1, we 
have 

Lew n(2u)*482) < 2exp(—w?n8?) , (15) 


whenever n8z > log2. We finish the proof by combining (14) and (15), and taking 6, = n 55, a 


Appendix A. 


Proof of Lemma 1. We write L(f(x)) = Ey,x[/(Y, f(X))|X = x] and recall that pj(x) = P(Y = 
qx =x) for all j= 1,...,m, and that f = (fi,.. fo) with X; fj = 0. Definition (4) of the loss 
and the fact that )""_; pj = 1 give 

i: 1 


Dai Bytnivirmen 


J j=l 





Let pk = maxjet1,...m} Pj. Here ff = —1/(m—1) for all j Ak, and ff =1. Let J” (k) = {iF 
k: fj >—1/(m-1), j=1,...,m} and J~(k) ={jAk : fj <—1/(m—-1), j=1,...,m}. Write 
A(f) = L(f)-L(f") 
1 


= Yil-p (fit S jr Pr) (Jk+ E p) +—). 


iv io 
We first consider the case fk > —1/(m-— 1). Here, 





1 





A(f) = (1— pe) (fe-1) + VP - pK J+- 
r m—1 
jfk 
The zero-sum constraint Bes ı fj =9 simply implies fk —1 = — X jzk(fj + -). Divide the sum 


into the sets J*(k) and J7 (k) to obtain 


1 1 
A(f)= } (pe-py) ist a) +(l-px) $ coe 
JEFE) m T 
For the case fy < E — 1), observe that 
1 
= =}( FiF = )4 Skt i < $ fi+ 


jfk jfk 





to obtain 





È 
3 
I 
= 
| 
v 
= 
iF 
+ 
T gm 
= 
S 
Ta 
> 
2 
+ 


V 


1 
-DEHEU -p 
JAK JAK 


= E (p) (t+) + -p Yt. 


jeJ*(k) JEJ- (k) 
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In both cases clearly L( f) — L(f*) is always non-negative since p — p j is non-negative for all j £ k. 
It follows that 


RORE) =F, [HA LF) Are =, max, pj) a0 
k=1 Joh 


is always non-negative, with Q the unknown marginal distribution of X. a 


Proof of Lemma 3. Let t be defined as in (8). We write L(f(x)) = Ey\x[/(Y, f(X))|X = x] and 
recall that p;(x) = P(Y = j|X = x) for all j = 1,...,m, and that f = (fi,..., fm) with D7) fj =0. 
From the proof of Lemma 1, clearly 


m 


(Lf) -L(f*)) Up, = max p) > tE- EI- F, 
ie jzk 2z 


peel 





where the second inequality is obtained from the fact that | fx — f| < Lizel f; — f;|. That is, the 
excess risk is lower bounded by 


1 a 
= tf;-f;|dQ . 
bf tii 


It implies that, for all z > 0, 
2k Z = K 2k 
RN- 233 [iso f_ ir-ra]. 
j=l TKZ 


Since |f; — f | <M for all j, and by Condition AA, the second integral in the inequality above can 
be upper bounded by M(Cz)!/. Thus, for all z > 0, 


' z i z 
R)-R > SY fii- Fldo — gmMeo'. 
I= 
Y Y 
We take z = ( "flf flao) / (mmea +!) when y > 0, and z ? 1/C when y= 0. W 


Symmetrization lemma (Pollard, 1984, Lemma II.3.8). Let {Z(t):t ET} and {Z'(t):t ET} be 
independent stochastic process sharing an index set T. Suppose there exist constants B > 0 and 
a > 0 such that P(|Z(t)| < œ) > B for every t € T. Then 


P (sup |Z(0)| > e) < B 'P(sup Z(t) —Z'(t)| >e-a) 
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